This project involves the exploratory analysis of the dataset called wineQualityReds using R. The analysis helps to find the properties that affect the quality of wine using the univariate, bivariate and trivariate plots between different variables.
## [1] "/home/sumit/Desktop/data_analyst_nanodegree/EDA_R_P4_f"
since we need to add new variable in the data so created a copy of it to make the analysis process easier.
## [1] 1599 12
The dataset contains the 13 features and total of 1599 observations.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Wine quality mean is 5.636 and median is 6. Mean and median are quite similar for quality.
From the histograms it can be found that pH, density and quality are in normalized form while some are skewed towards left, some have outliers like sulphur relatred factors, chlorides and residual sugars. citric acid contains maximum null values.
It shows that there are 5 types of numerical quality in the datset ranging from 3 to 8 and most values of quality are 5 and 6.
Factoring the quality variable for better plots
converting it to factor variable would make it easier to run the analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## low medium highest
## 63 1319 217
Converting the wine quality into rating low, medium and highest for better analysis.
The residual sulphates, chlorides and residual sugar has the been found with the greater number of ouliers.
Scaling these so, that the graphs becomes normal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The Graph now becomes in normal form.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The above graph becomes in normal form now.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
There is some outliers in the pH.
Box plot of citric acid before removing NULL values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Box plot of citric acid after removing NULL values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
As there is not much difference in the boxplot after removing the Null values. so, this shows that there is some missing data.
There are total of 1,599 wine observations and 13 numeric variables. X is the unique identifier and fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality are the other 12 features.
Quality is the output variable and all others are the input variables.
Main features of interest is the quality. We try to find out how all the other variables seemingly influence the quality of the wine.
the following variables will support investigation as they have an interesting effect on the quality of wine: 1. alcohol 2. sulphates 3. citric.acid 4. volatile acid
yes, the outcome variable is converted into levels (low, avg and high) for the better analysis of data.
In chlorides and residual sugar, the distribution is highly right skewed. Here, a transformation is applied to make the distribution normal. Also factoring is done on quality variable to make its analysis easier.
Calculating relationship between variables using coorelation values.
The correlation coefficients help in determining the strength of the bivariate relations. Highly correlated values include the alcohol content vs quality as well as sulphates, citric acid and so on has a higher effect on the quality.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
Features that are positively correlated with quality are: alcohol:quality = 0.48 sulphates:quality = 0.25 citric.acid:quality = 0.23 fixed.acidity:quality = 0.12 residual.sugar:quality = 0.01
Features that are negatively correlated with quality are: volatile.acidity:quality = -0.39 total.sulfur.dioxide:quality = -0.19 density:quality = -0.17 chlorides:quality = -0.13 pH:quality = -0.06 free sulphur dioxide:quality = -0.05
On the basis of above plots we can justify the correlation how quality changes with alcohol. Higher quality wines has greater alcohol content as Alcohol has the highest coorelation with the quality(0.48).
This plot shows effect of wine quality on different acids. citric acid and quality are highly correlated(0.23) than fixed.acidity and quality(0.12) while volatile.acidity and quality are highly negatively correlated (-0.39).
The quality is higher for the wine with low volatile acidity. since the volatile acidity and quality are negatively coorelated(-0.39).
citric acid and sulphates are highly correlated(0.31) while alcohol vs sulphates(0.109) and alcohol vs citric acid(0.093) are moderately correlated.
pH has a very small correlation with quality(-0.058). It can be assumed that the higher quality wine has lower pH. But according to Plot most of the medium quality plot also has lower pH. This can be due to outlier.
pH increases with decrease in acidity since they are negatively coorelated at
## [1] -0.68
From the Coorelation matrix, it appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. From the individual correlation tests, I found the similar trends with the exception of the pH showing less correlation.
In the correlation matrix I found that fixed acidity and density are positively correlated. This shows that when fixed acidity increases the density of the wine is higher. The volatile acidity negatively correlates with the citric acid. This is an interesting observation. Also pH doesn’t have a much impact on quality of wine due to small correlation.
The strongest relationship found between the fixed.acidity and citric.acid with correlation coefficient equal to 0.67 & relation between fixed.acidity and density with correlation coefficient equal to 0.67.
The features of interest I get in Bivariate plots I am going to further explore them.
With the help of this plot we can get to an interpretation that both alchohol and sulphur are necessary for good wine.
This plot shows the very good understanding of the good wine(low acidity and high alchohol) and poor wine(high acidity and low alchohol)
This plot shows the very good understanding of the good wine(low citric acid and high alchohol) and poor wine(high citric acid and low alchohol)
pH doesn’t have a large impact on wine quality.
Effect of Alcohol and volatile Acidity on Wine extreme qualities.
For the multivariate analysis the relationship between volatile.acidity and alcohol shows the great information. This plot shows the very good understanding of the good wine(low acidity and high alchohol) and poor wine(high acidity and low alchohol).
Since alcohol, specifically ethanol, is a weak acid, it was thought to be somewhat correlated with the presence of other acids, such as citric acid. The plot of alcohol against citric acid above clearly show their lack of correlation to each other. Also due to the small range of pH(3-4) there is not much effect observed on quality of alcohol.
correlation scores:-
citric.acid:quality = 0.23 fixed.acidity:quality = 0.12 residual.sugar:quality = 0.01
This plot shows that higher citric acid are found in better quality wines as their correlation scores(0.23) are greater. The absence of volatile acid also contribute to the higher quality wine.
Shows correlation between alcohol and quality(0.48). Due to greater correlation between alcohol and quality, alcohol has greater impact on quality of wine.
Effect of Alcohol and volatile Acidity on Wine extreme qualities(correlation score is -0.202). It shows that high volatile acidity with low alcohol content kept wine quality down and vice versa.
The analysis began by loading the dataset and obtaining the overview of data. Univariate analysis is done in the first part. Many histograms were plotted. Plotted the distributions of all the variables in the dataset. Also, the quality variable was converted into a factor variable with levels. This helped in analysis of the quality variable.
I faced difficulty while analyzing the scatter plot with the function corrplot() So, I calculated the value of correlation separately to better analyze the data. Review a categorical variable is created. It gives the wine 3 grades low(3-4), avg(5-6) and high(7-8).
Applied Log transformations to variables like chlorides and residual sugar because distribution was highly skewed.
The coorelation coefficients is finded to study the relationships between all variables. With the help of coorelation coefficients the effect of each variable on quality is finded.The variables that had been identified to have strong correlation with quality are Volatile Acidity, Sulphates, Citric Acid. Plots were drawn to re-iterate the same variables and study their effect separately.
Multivariate analysis includes the exploration of the interaction of the variables and analysis to check the position of the high quality wine to establish relationships.
The wine quality is highly subjective on a individuals taste. A better study would be the inclusion of wine quantities sold in the market. Also the predictive model can be built to predict the wine quality.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!